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ABSTRACT 

We collected a massive and heterogeneous 
dataset of 20255 gene expression profiles (GEPs) 
from a variety of human samples and experimental 
conditions, as well as 8895 GEPs from mouse 
samples. We developed a mutual information (Ml) 
reverse-engineering approach to quantify the 
extent to which the mRNA levels of two genes are 
related to each other across the dataset. The result- 
ing networks consist of 4817629 connections 
among 20255 transcripts in human and 14461095 
connections among 45101 transcripts in mouse, 
with a inter-species conservation of 12%. The 
inferred connections were compared against 
known interactions to assess their biological signifi- 
cance. We experimentally validated a subset of not 
previously described protein-protein interactions. 
We discovered co-expressed modules within the 
networks, consisting of genes strongly connected 
to each other, which carry out specific biological 
functions, and tend to be in physical proximity at 
the chromatin level in the nucleus. We show that 
the network can be used to predict the biological 
function and subcellular localization of a protein, 
and to elucidate the function of a disease gene. 
We experimentally verified that granulin precursor 
(GRN) gene, whose mutations cause frontotemporal 
lobar degeneration, is involved in lysosome 
function. We have developed an online tool to 
explore the human and mouse gene networks. 



INTRODUCTION 

Tens of thousands of protein-protein, protein-DNA and 
protein-RNA interactions have been experimentally 
identified in mammalian organisms (1,2). However, they 
constitute only a small part of the cell regulatory network. 
Efforts have been made to infer transcriptional gene 
networks directly from gene expression profiles, using a 
variety of 'reverse-engineering' algorithms (3-9). Among 
the plethora of different approaches to reverse engineer- 
ing, only information-theoretic approaches are applicable 
at the genome scale (10). In these approaches, the network 
among genes is reconstructed by considering pairs of genes 
and checking whether the two genes in each pair are sig- 
nificantly co-regulated across the experimental dataset by 
mutual information (MI), a probabilistic measure of re- 
latedness (11). Significant co-regulations among genes are 
then represented as a network, by connecting two genes 
with an edge if their pairwise MI is significant. 

Since MI measures statistical dependencies between two 
variables, an edge in the network implies a coordinated 
response between the two connected genes, but does not 
necessarily imply causality. Hence, a gene-gene 'connec- 
tion 1 is not necessarily a direct physical interaction 
between the protein products of the two genes, or a tran- 
scription factor (TF)-target gene interaction, but can also 
imply a functional, but indirect, regulation, through one 
or more intermediaries. 

In order to eliminate indirect interactions, the final 
network is usually pruned by removing edges which 
have a higher probability of representing indirect relation- 
ships, using a variety of techniques (4,5). The pruned 
network can then be used to discover TF-target-gene 
interactions (4,5). 
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Another popular way to measure relatedness between 
two genes is correlation that measures co-expression 
between two genes. A limitation of correlation is its 
ability to measure only linear relationships between 
genes (i.e. gene A increases/decreases linearly with 
gene B). However, it fails when relationships are more 
complex (saturation, hysteresys, etc.), whereas MI is not 
affected at all by non-linearities (12). 

Reverse engineering becomes much more powerful as 
the number of gene expression profiles (GEPs) used to 
infer the network increases (10,13). However, the require- 
ment of using homogeneous GEPs (i.e. from a specific cell 
type, tissue or condition) typically limits their number to 
the order of hundreds. There has been a multitude of 
approaches towards integrating heterogeneous gene ex- 
pression profiles from multiple experiments (14-16). Two 
main strategies can be recognized: (i) a 'pluribus unum' 
approach, where the different GEPs within each experi- 
ment are processed as if they were part of a single massive 
experiment. The disadvantage of this approach is that nor- 
malization of large heterogeneous datasets forces expres- 
sion values to be comparable across conditions even if 
they are not; moreover, only around half of expression 
datasets in public repositories contain unprocessed data 
(e.g. Affymetrix CEL file), which are indeed needed for 
normalization; (ii) a 'divide and conquer' approach, 
where each experiment is used independently to compute 
a measure of co-regulation among genes of interest. This 
measure is then averaged out across the different experi- 
ments. The disadvantages are 2-fold: a loss of informa- 
tion, since experiments may differ considerably in the 
number of expression profiles, thus leading to discard 
some experiments due the paucity of samples; and a loss 
in the precision of the computed 'co-regulation' measure, 
due to the fragmentation of the dataset. 

An example of the 'pluribus unum' approach can be 
found in Ref. (14), where a collection of 5372 microarrays 
from different tissues and conditions was simultaneously 
normalized together using standard procedures. This is 
considered a significant achievement due to the large 
number of samples analysed. The results were used to 
relate genes to the conditions in which they are over- or 
under-expressed. Examples of the 'divide and conquer' 
approach, are found in Refs (15,16) where the Pearson 
correlation coefficient is measured independently in each 
experiment for each gene. In the study by Lukk et al. (15), 
a final list of genes co-expressed with a gene of interest is 
obtained via rank aggregation methods, whereas in 
Ref. (16) an averaged correlation value across experiments 
is attributed to each gene pair. 

Our starting hypothesis was that, despite the extreme 
heterogeneity of gene expression profiles coming from 
different cell types, tissues and conditions, it is indeed 
possible to infer a meaningful 'consensus' gene network 
from tens of thousands of GEPs, which can then be used 
to investigate cell transcriptome organization and gene 
function. Towards this end, we collected a massive and 
heterogeneous dataset of >20000 GEPs measured in 
human samples from almost 600 different experiments, 
as well as a similar number of GEPs in mouse samples. 
About a third of these, contained only normalized GEPs 



that cannot be directly compared across different 
experiments. 

To infer a gene network from this massive dataset, we 
needed to overcome current limitations of multi- 
experiment intergration approaches and of Mi-based 
reverse-engineering algorithms, which have been applied 
at most to hundreds of GEPs from the same tissue, or cell 
type, and which require normalized GEPs across the 
whole dataset. We, therefore, developed and applied a 
simple but powerful approach to recover a comprehensive 
gene network among most of the known genes from this 
massive dataset. We decided not to apply a 'network 
pruning' step, since we wanted to keep as many meaning- 
ful gene-gene connections as possible, even if these are not 
direct TF-target gene interactions. Indeed, we show that 
the resulting network is a powerful resource that can be 
used to discover new protein-protein interactions, to gain 
insight into the cell transcriptome organization, and to 
make hypothesis on the biological function of a gene. 

We have developed an online tool (http://netview.tigem 
.it) for querying and exploring the gene network, for both 
human and mouse species. 



MATERIALS AND METHODS 

Gene expression profile dataset, gene network inference 
and computation of pairwise MI 

We collected, from ArrayExpress (17), 20255 GEPs (591 
experiments) in human samples measured with the 
Affymetrix HG-U133A microarray, and 8895 GEPs (614 
experiments) in mouse samples, with the Affymetrix 
Mouse430A_2 (Supplementary Table SI). Normalized 
GEPs within each experiment were retrieved from the 
database. The two mammalian dataset were analysed sep- 
arately. The expression values of all GEPs within an 
experiment were discretized into a predetermined 
number n of bins (n = 3), with equal number of values 
in each bin. The bins are determined using the n quantiles 
of the normalized expression values as cut points. Each 
expression value is then replaced by an integer value cor- 
responding to the bin it falls into (see also Supplementary 
Data). 

In this way, we could use all the available experiments 
even when unprocessed data were not available in the 
public repository (~30% of the experiments have no un- 
processed data). Moreover, this simple technique works 
even when different normalization algorithms are used. 
The discretized values of gene expression are used to 
compute a 'co-occurency' matrix between each pair of 
probe-sets in each experiment, which is then used to 
estimate the joint probability distribution across the 
experiments. 

Specifically, we considered two discrete random vari- 
ables/and /assuming values in the set {1, 2, 3} describing 
the discretized expression values of two probe-sets / and /. 
In this context, MI can be defined as: 



(=1 j=l n, + n 



(1) 
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where ny represents the joint probability P(I = i, J = j) 
with (i, j)e{l, 2, 3} x {1, 2, 3} and n i+ = J^/Ky and 
K+j = 2^; 7C ;/ are ' respectively, their marginal probabilities 
P(I= 0 and P(J = j). 

In order to estimate the joint probability jc#, we used a 
simple frequentist approach: let «| be the counts of the 
outcomes (/ = i, J = j) across the n k GEPs of experiment 
k, then the frequency fty can be estimated jointly from the 
all the K experiments: 



- _ Lk=\ n ij _ riij 

This leads to a point estimate of MI equal to 
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(2) 
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The computational complexity of our algorithm is 
o(N 2 -n), where N is the number of probe-sets and n is 
the total number of GEPs. 

The choice of a different number of discretization bins 
does not considerably affects the results (Supplementary 
Data and Supplementary Figure S4). The MI significant 
threshold was obtained by fitting a Gamma distribution to 
the values of the MI across all the probes' pairs (18) and 
selecting only MI with a P<0.01. 

Yeast two hybrid assays 

The Yeast two Hybrid (Y2H) kit 'ProQuest Two-Hybrid 
System' (Invitrogen) included the Saccharomyces 
cerevisiae MAV 203 strain (MATa, Ieu2-3,H2, trp 1-901, 
his3A200, ade2-101, gal4A, gal80A, SPAL10::URA3, 
GALl::lacZ, HIS3UAS GAL1::HIS3@LYS2, canlR, 
cyh2R), the bait vector pDEST32 and the prey vector 
pDEST22. The 'Ultimate ORF' (Invitrogen) of the genes 
of interest were used to generate prey and bait plasmids 
using the Gate Way technology and protein-protein inter- 
action assays were performed according to the manufac- 
turer the instructions, along with the appropriate positive 
and negative control (Supplementary Table S5 and 
Supplementary Data). 

Cell culture and transfection 

For GRN functional assays, HeLa cells were cultured in 
DMEM supplemented with 10% FBS and treated for 96 h 
in the presence of sucrose to a final concentration of 
100 mM with daily changes of medium. Cells were col- 
lected and analysed by immunofluorescence and 
RealTime PCR (Supplementary Data). For TFEB or 
GRN over-expression 500 000 HeLa cells were transfected 
with 4 ug of DNA expressing the human TFEB or GRN 
using lipofectamine transfection reagent (Invitrogen), and 
collected after 48 h (Supplementary Data). 

Electron microscopy 

GRN or EGFP over-expressing HeLa cells were fixed in 
1% glutaraldehyde in 0.2 M HEPES buffer and further 
incubated in uranyl acetate and in Os04. After dehydra- 
tion through a graded series of ethanol, the cells were 



cleared in propylene oxide, embedded in the Epoxy 
medium (Epon 812) and polymerized at 60° C for 72 h. 
From each sample, thin sections were cut with a Leica 
EM UC6 ultramicrotome. Electron microscopic images 
were acquired from thin sections using an FEI Tecnai-12 
electron microscope equipped with an ULTRA VIEW 
CCD digital camera (FEI, Einhoven, The Netherlands). 
Quantification of the lysosome-like organelle dimensions 
was performed using the AnalySIS software (Soft Imaging 
Systems GmbH, Munster, Germany). 



RESULTS 

We collected 20255 GEPs from 591 different experiments 
(Supplementary Table SI) performed in a variety of 
human tissues, cell types and conditions from public 
microarray repositories. A total of 22 283 different tran- 
scripts were measured, corresponding to the probe sets in 
the Affymetrix HG-U133A microarray. Our aim was to 
reverse engineer a consensus transcriptional gene network 
among the 22 283 transcripts from these multiple experi- 
ments. However, only ~70% of these experiments con- 
tained unprocessed data (CEL file), whereas the 
remaining 30% contained just normalized GEPs. 

We could not apply 'off-the-shelf (Supplementary 
Data) state-of-the-art approaches to reverse engineering 
(5,19), because gene expression profiles are not compar- 
able to each other across experiments in different tissues, 
cell types or conditions. For example, one of the most 
successful reverse-engineering algorithms based on MI 
[ARACNE (5)] is limited by a considerable loss of preci- 
sion, if the 'divide and conquer' approach to multiple ex- 
periments is used (Supplementary Data). 

To overcome these problems, we developed a simple MI 
reverse-engineering approach, which works in two steps: 
(i) by discretizing normalized gene expression values 
within each experiment and computing for each pair of 
genes a 'co-occurrency' matrix, which summarizes the co- 
herence in gene expression changes across the samples in 
the same experiment; (ii) by estimating MI from the whole 
dataset at once, summing up the co-occurrency matrices 
across all the experiments, thus avoiding loss in precision 
due to dataset fragmentation (Supplementary Data). We 
then generated a network by connecting two genes in the 
network, if the associated MI was greater than a signifi- 
cance threshold (P<0.01) obtained by fitting a Gamma 
distribution to the MI values, which has been shown to be 
a good approximation under the null hypothesis of statis- 
tical independence (18). 

Our approach can in principle be applied also to GEPs 
obtained from RNA-seq technology, but at the moment 
their number is much smaller than the GEPs obtained 
from microarrays. 

The inferred network consists of 4 817 629 connections 
among 22283 transcripts (probe-set on the microarray). 
To our knowledge, this is the largest and most compre- 
hensive dataset ever used to reverse engineer a transcrip- 
tional gene network. 
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Connections in the gene network are biologically relevant 

To validate at least a subset of predictions, we generated a 
reference human interactome, Golden Standard (GoS), 
consisting of 105 588 experimentally verified interactions 
including protein-protein (2), TF-target gene and meta- 
bolic interactions (1). 

Since the GoS interactome is at the gene/protein level, 
whereas the network we inferred is at the 'probe-set' level, 
we first needed to transform probe-sets to the correspond- 
ing genes. 

In order to reduce as much as possible cross- 
hybridization issues known to affect microarray technol- 
ogy (20), we first removed 18.8% (4187) of the probe-sets 
that could not be reliably mapped to the coding genome, 
or that mapped to multiple genes (20). 

After this step, only probes associated to a single gene 
were retained. However, some genes can be associated to 
more than one probe-set. We verified that 35% (4200/ 
11978) of the genes represented on the microarray are 
indeed associated to multiple probe-sets (Supplementary 
Table S2). We checked for the consistency between 
probe-sets associated to the same gene by verifying that 
these probe-sets were indeed connected in the probe-set 
level network. We found that 69% (2884/4200) of the 
genes mapped by multiple probes are consistent 
(Supplementary Table S3). Out of the remaining 31% of 
the genes, 14% (596/4200) are associated to probe-sets 
that target alternative transcripts of the same gene, 
which may not be well co-regulated, and hence were 
retained in the analyis. The remaining 1243 of the 
probe-sets corresponding to 540 genes were removed 
from further analysis. 

We then generated a gene-level network from the 
probe-set level network by connecting two genes with an 
edge, if the corresponding probe-sets were connected in 
the probe-set level network. When one, or both genes 
were associated to more than one probe-set, we required 
that at least one probe-set pair was connected in the 
probe-set level network, and assigned as the MI value of 
the gene pair, the maximum MI among the corresponding 
probe-set pairs (21). 

Supplementary Figure Sla shows the percentage of 
inferred connections that were confirmed by the GoS 
interactome. The network reaches a maximum of 90% 
of correct predictions, with an average precision of 32%. 
We estimated the percentage of correct connections, had 
these been randomly guessed, to be equal to 0.0028%, as 
described in Ref. (10). The GoS interactome includes only 
a subset of the interactions occurring in a cell, because 
only a small subset of these have been experimentally 
verified. Moreover, a high MI does not necessarily imply 
physical interactions. Nevertheless, the GoS interactome 
provides evidence of the biological significance of the 
inferred connections. 

For comparison, Supplementary Figure Sla also shows 
the percentage of correctly predicted connections by 
CoexpresDB (16), a database of human co-expressed 
genes measured using a classic Pearson Correlation 
Coefficient (PCC) from multiple experiments using the 
'divide and conquer' strategy. 



Connections tend to be conserved across species 

To understand whether, and to what extent, connections 
among genes are conserved across human and mouse 
species, we collected 8895 gene expression profiles from 
614 experiments in mouse, each measuring 45 101 tran- 
scripts, corresponding to the probe-sets in the 
Affymetrix Mouse430A_2 microarray. We then applied 
our approach to this massive dataset to reconstruct the 
network. Out of all the possible 1 017 027 550 gene-gene 
connections, only 14461 095 connections among 45 101 
transcripts were deemed significant (_P<0.01). The differ- 
ence in the number of inferred connections between the 
human and mouse network is due to the different number 
of probe sets between the two microarray models 
analysed. 

In order to compare the two networks, we first removed 
from the human network those genes without an ortholog 
in mouse, resulting in a 'reduced' network of 1 1 318 genes. 
We then found that 218 700 connections (12%) were 
conserved in mouse (Supplementary Figure Sib). This per- 
centage is in line with previous studies involving a limited 
number of known protein-protein or protein-DNA inter- 
actions. In yeast, it has been reported that between 10% 
(22) and 30% (23) of protein-protein interactions occurring 
during the cell cycle of S. pombe (fission yeast) and 
5. cerevisiae (budding yeast) are conserved; another 
cross-species (fly and yeast) protein interaction study (24) 
resulted in a ratio of conservation ranging from 6% to 
15%; in Ref. (25), the authors report a database of 
protein-protein interactions occurring among transcription 
factors in human and mouse, where the percentage of 
effective conserved interactions is ~16% (the estimated 
range is 34-64% when taking into account the false 
positive rate of the experimental technique). A recent 
genome-wide analysis (26) integrating heterogeneous sets 
of experimental data (including 338 expression profiles in 
human and 1048 in mouse) showed a conservation of 15% 
of interactions between the two mammalian species. 

Prediction of new protein-protein interactions 

We investigated the identity of the top one thousand con- 
nections with the highest MI in the network (Figure 1, 
Supplementary Figure S2, and Supplementary Table S4). 
Forty per cent of these connections, involving a total of 
302 genes, were confirmed by the GoS interactome. An 
additional 13% of the connections were predicted 
among genes in the same gene-family, which, therefore, 
may well be functionally related, although not physically 
interacting. 

In order to test the predictive ability of the network, we 
focused on the subnetwork (b) in Figure 1 consisting of 12 
genes most of which (CENPF, NUSAP1, KIF2C, BUB1B, 
ASPM, ZWINT and CCNB2) involved in mitotic spindle 
checkpoint, chromosome motility and mitotic progression 
(27-30). 

According to the GoS interactome, three protein- 
protein interactions were known to occur among the 
genes in subnetwork (b), therefore we decided to verify 
whether the predicted connections, could be yet undiscov- 
ered protein-protein interactions. We selected only the 
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Figure 1. Subnetworks obtained by collecting the top 1000 connections with the highest MI within the network. Subnetwork (a) contains genes that 
codify for the 'ribosomal protein complex', ('Translation', P < 1.0 ). Subnetwork (b) is enriched for genes involved in the 'spindle checkpoint' 
('Nuclear division' P < 3.1~ 7 ), for clarity only a subset of interactions are shown. Subnetwork (c) is enriched for 'metallothionein' genes, a family of 
low molecular weight, heavy metal binding proteins. Subnetwork (d) contains major histocompatibility complex proteins (Antigen processing and 
presentation, / > <7.5 -16 ). Pairs of genes are connected if their MI (probabilistic measure of relatedness) confirms a significant co-regulation. 



subset of seven genes (NUSAP1, KIF2C, BUB1B, ASPM, 
ZWINT, KIAA0101 and RRM2), which according to our 
network formed a tight cluster of genes all connected to 
each other. We performed a series of yeast two hybrid 
(Y2H) assays to test a total of 21 connections (i.e. all 
the possible interactions among seven proteins). 
According to the Y2H assay, 20 of these were positive. 
Since Y2H are known to be prone to false positive detec- 
tions, we also experimentally estimated the precision (true 
positives over true positives plus false positives) of the 
Y2H assay by using appropriate positive and negative 
controls (Supplementary Data). The estimated precision 
resulted to be equal to 77%, hence, at least 15 ( = 77% 
of 20) of the experimentally identified interactions should 
be true positive predictions. This means that we can 
predict new protein-protein interactions with a precision 
of 75% (15 out of 21, Supplementary Table S5). 

The modular structure of the network 

The structure of the network has a typical exponential 
degree distribution (31,32) consisting of a large number 
of genes with very few connections, and very few genes 
with a large number of connections, termed hubs. 



We observed that, as the number of connections of a 
gene increases, so does its average expression level 
(Supplementary Figure S3a,b and Supplementary Data); 
in contrast, the intrinsic protein disorder (33) of its 
protein product significantly decreases (P = 0.009) 
(Supplementary Figure S3c,d and Supplementary Data). 
Protein disorder, defined as the length of the unstructured 
part of a protein, is an important determinant of gene 
dosage sensitivity (34). 

A cell is able to regulate its complex behaviour thanks 
to sets of genes that perform different but coordinated 
functions. We asked whether we could find such function- 
al modules within the network, which could reveal 
how the cell transcriptome is organized. We searched the 
network for modules, which are defined as 'communities' 
and 'rich-clubs' in network theory. A community is a 
group of genes highly inter-connected to each other, but 
with few connections to genes outside the group. A 
'rich-club' can be defined as a 'community of communities', 
i.e. a group of closely inter-connected communities. 

In order to identify communities, we represented the 
network as a matrix, as shown in Figure 2. The matrix 
was obtained by defining each entry m, y as the value of the 
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Figure 2. Modular structure of the network. Adjacency matrix of the network before (a) and after (b) the hierarchical clustering procedure used to 
identify communities. Each dot represents a connection among two genes, that is a matrix entry, whose MI is greater than the significance threshold, 
(a) Genes are sorted according to their chromosomal location. Numbers on the x and y axes indicate chromosomes, (b) Genes are sorted according 
the community they belong to. Square dimensions are proportional to the number of genes in each community. The inset shows an enlargement of an 
area of the adjacency matrix where single dots are visible. 



MI between gene i (in the r'-th row) and gene j (in the y'-th 
column). We observed that genes lying within the same 
chromosome (squares in Figure 2a) tend to be connected 
to each other more often than what would be expected by 
chance. We then applied a hierarchical clustering algo- 
rithm to the matrix representing the network, and thus 
identified 393 communities with more than 4 genes 
(Supplementary Table S6 and Supplementary Data). 

Figure 2 shows the matrix representing the network 
before (a) and after (b) the hierarchical clustering proced- 
ure. Communities appear as dark squares, with genes be- 
longing to the same community grouped together, giving a 
striking check-board pattern. 

We assessed that 36% of these communities are 
enriched for a specific biological function by Gene 
Ontology analysis (Supplementary Table S7). This per- 
centage increases up to 47%, when considering only 
communities composed by >10 genes. We also found 
that 6 out of 393 communities of the network are signifi- 
cantly enriched for disease genes (P < 0.05, Gene Set 
Enrichment Analysis, in Supplementary Data): commu- 
nity number 1, 11, 22, 40, 54 and 96. The most significant 
community, number 40, is composed by genes whose 
protein products localize to the 'lysosome', and is highly 
enriched for disease-genes involved in lysosomal storage 
disorders. Other examples include community 11, whose 
genes are related to cell adhesion and extracellular matrix 
organization, and include disease-genes causing develop- 
mental or cardiovascular defects; community 22, related 
to the immune system and including genes causing related 
disorders; community 54, composed by genes involved in 



oxygen transport and enriched for genes involved in 
haematological disorders. These 'disease communities' 
could contain other yet unknown disease-related genes, 
and could be helpful in identifying candidate genes in 
disease-related loci. 

We observed that communities interact with each other; 
for example community 1, enriched for transmembrane 
receptor activity (P = 2.01 x 10~ 35 ) interacts with commu- 
nity 3, enriched for 'extracellular region' 
(P = 7.33 x 10~ 06 , Figure 2b, arrow), but not to commu- 
nity 2 involved in 'RNA processing'. 

In order to better elucidate community function and 
interactions among them, we defined the interaction 
strength (IS) between two communities as the number of 
connections occurring among genes belonging to the two 
different communities, divided by the expected number of 
connections. The IS is equal to 0 if no connections exist 
among genes belonging to the two different communities. 
We computed the IS between all the pairs of 393 
communities for a total of 77028. Only 5074 pairs of 
communities had an IS >0. Similarly to the gene-wise 
network, also the community-wise network can be repre- 
sented as a matrix. We can, therefore, apply a clustering 
procedure to group the communities into sets of highly 
interconnected communities ('rich-clubs') (35) 
(Supplementary Data). 

We thus obtained 58 rich-clubs. Each rich-club has a 
representative community termed 'exemplar', to which 
all of the other communities in the rich-club are 
connected. The community-wise network is shown in 
Figure 3. 
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Figure 3. Community-wise network. Each node is a community. A color and a number identify each rich-club (i.e. a group of highly interconnected 
communities). The width of each edge reflects the IS between communities. 'Exemplars' are indicated by triangles. Examples of rich clubs: 
(a) communities of genes involved in 'intracellular trafficking'; (b) communities involved in the 'extracellular matrix maintenance, and cell 
mobility'; (c) communities involved in 'immune response'; (d) communities of genes involved in house-keeping functions: 'gene expression' 
(rich-club 197), 'translation' (rich-club 246), 'RNA processing' (rich-club 2) and 'cell cycle' (rich-club 7). 



Communities in our network can be considered as func- 
tional modules consisting of genes whose expression is 
coordinated, and that carry out specific biological func- 
tions. We asked whether these communities could have a 
'physical counterpart 1 in the cell. We investigated whether 
genes that are connected to each other according to our 
network, were also physically close to each other at the 
chromatin level, in the cell nucleus. We used a recent com- 
prehensive 3D topological mapping of chromosomal loci 
physical interactions using an innovative 'Hi-C chromatin 
capture technology (36). 

This map of physical interactions can be represented as 
a matrix (M p ), where each entry my reports the probability 
of the z'-th Mb of the genome to be in physical contact with 
the 7-th Mb, according to 'Hi-C experimental results. 
Figure 4b is a graphical representation of this map for 
chromosome 19. 

In order to compare our network with the physical 
contact probability map, we generated a 'connection 
tendency matrix' (M c ) at 1 Mb resolution from the 
network. Each entry my of this matrix is simply the 
number of connections occurring among genes found 
within the i-th and j-th Mb of the genome, divided by 
the expected number of connections. Figure 4a is a graph- 
ical representation of the network at 1 Mb resolution, 
where red color represents two chromosomal regions 
which contain more connections than expected among 
the genes they harbour. 



In both matrices (Figure 4a and b), there is a clear 'plaid 
pattern' highlighting chromosomal regions whose genes 
are strongly connected to each other (red in Figure 4a) 
and regions that are physically close to each other at the 
chromatin level (red in Figure 4b). These regions have a 
striking overlap (correlation = 0.4, P = 7.3 x 10 ) espe- 
cially the p-arm of chromosome 19 (upper left square in 
both matrices), revealing that genes that are physically 
close to each other at the chromatin level tend to be 
'co-regulated' (i.e. have a significant MI) and vice versa. 

By extending this analysis to all the chromosomes, we 
found a significant overlap (correlation significance: 
P<0.01) for all but three chromosomes (9, 20 and 21). 
The significance is also present at the whole genome level, 
when considering also inter-chromosomal interactions. 



Elucidating gene function 

We exploited the information embedded in the network to 
identify gene function, or protein subcellular localization, 
via a guilty-by-association analysis. It consists in assigning 
a function to a gene (or a localization to the encoded 
protein) by checking whether there is a shared function 
among the genes connected to it (or a shared localization 
of their protein products). In what follows, we termed 
'gene neighbours' the set of genes connected to a gene of 
interest. 
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Figure 4. Genes that are co-regulated tend to be physically close at the 3D chromatin level, (a) Connection tendency matrix of chromosome 19. Grey 
stripes highlight regions with no probes designed for the microarray model HG-U133A. A red color indicates two different 1 Mb loci whose 
genes are strongly connected to each other, (b) Physical contact matrix of chromosome 19. Grey stripes highlights chromosomal regions where 
centromeres are located, plus unalignable regions. A red color indicates two different 1 Mb loci that are physically close to each at the chromatin 
level. Physically close regions may also contain genes that are not co-expressed and vice versa: region (I) in (a) has an opposite tendency with respect 
to the corresponding region (I) in (b). This means that regions that are not in physical contact may contain genes that are co-expressed. The opposite 
can also be true, for example region (II) shows that loci physically interacting with each other do not necessarily contain genes that are co-expressed. 



We performed a Gene Ontology Enrichment Analysis 
(GOEA) on the set of gene neighbours for each gene 
(http://netview.tigem.it). We then selected, as a test set, 
18 141 transcripts for which the function/localization was 
known according to their Gene Ontology classification. 
The percentage of correct predictions for each of the 
three GO classes (Biological Process, Molecular 
Function and Cellular Localization) ranged from 59% 
to 71%, respectively. 

We next asked whether the guilty-by-association 
approach could be helpful in generating hypotheses on 
the function of genes involved in genetic diseases. We 
used our approach to identify human disease genes that 
may have a yet undiscovered role in lysosome function 
and organization. 

To this end, we ranked, by their P- value from the 
GOEA, all the genes predicted to be lysosomal, or 
associated with lysosome organization, by our guilty- 
by-association analysis (http://netview.tigem.it and 
Supplementary Table S8). The top ranked genes included 
both lysosomal enzymes and other genes involved in lyso- 
somal function. Among these, at position one, NPC2 
(Niemann-Pick disease, type C2) disease-gene, is known 
to be an intralysosomal gene (37); at position two, we 
found another disease-gene, GRN. Both genes are also 
members of community 40, which is enriched both in 
disease genes and in lysosomal genes. 

Despite extensive studies, the role of GRN is far from 
being understood and it has not been directly linked to 
lysosomal function. GRN is a highly conserved gene 
bearing multiple copies of the cysteine-rich granulin 
motifs. Proteolytic cleavage of the precursor protein by 
extracellular proteases, gives rise to smaller peptide frag- 
ments termed granulins which have been linked to a range 
of biological functions including 'cell division', 'survival', 
and 'migration' (38). Mutations in GRN cause 



frontotemporal lobar degeneration with ubiquitin- 
immunoreactive neuronal inclusions (FTLD-U) (39,40). 
Of note, mutations in NPC2 results in a wide spectrum 
of clinical phenotypes including a form of frontal lobe 
atrophy (41). 

To investigate whether GRN was indeed related to lyso- 
somal function, we first evaluated GRN expression levels 
following sucrose treatment, a known inducer of lyso- 
somal biogenesis (42,43). Following sucrose treatment, 
we observed a 2-fold increase over baseline in GRN 
mRNA levels, along with a 3-fold increase in Cathepsin 
D (CTSD), a lysosomal enzyme used as positive control 
(Figure 5a). We also found an increased GRN 
immunostaining using a specific anti-GRN antibody in 
sucrose-treated cells (data not shown). 

The transcription factor EB (TFEB) has been recently 
identified as the transcription factor controlling most of 
the known lysosomal genes via direct binding to their 
proximal promoter (44), and therefore we next asked 
whether GRN is regulated by this transcription factor. 
We first identified, by bioinformatics analysis, two 
TFEB binding sites upstream of the GRN coding 
sequence (Supplementary Data). We then over-expressed 
TFEB in human cell lines and detected a 3-fold increase in 
GRN mRN A levels, along with a 3-fold increase in CTSD, 
a known target of TFEB, used as a positive control 
(Figure 5b). 

We next over-expressed GRN in HeLa cells to investi- 
gate if this had any effect on lysosomes; as shown in 
Figure 5c, the LAMP1 and LAMP2 signal significantly 
increased when compared to a mock control, or 
over-expression of EGFP, as assessed by immunofluores- 
cence (Supplementary Data). Electron microscopy in 
HeLa over-expressing GRN showed that the increase in 
LAMP1 and LAMP2 is likely due to an increase in the 
size of lysosomes but not in their number (Figure 6). 
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Figure 5. GRN is involved in lysosomal function, (a) GRN and CTSD increase in expression level in HeLa following sucrose treatment, a known 
inducer of lysosomes, as measured by realtime qPCR. (b) Expression level of GRN and of CTSD increase in HeLa cells following TFEB 
over-expression, a transcription factor known to regulate lysosome biogenesis, (c) Immunofluorescence with antibody anti-LAMPl and 
anti-LAMP2, used as a lysosomal marker, of transfected HeLa cells over-expressing GRN, or EGFP and in HeLa cells grown in medium collected 
after GRN, or EGFP over-expression. 



The effect of GRN on lysosomes was observed on the 
great majority of cells, despite transfection efficiency being 
not as extensive. Therefore, we hypothesized that GRN- 
transfected cells could secrete a factor inducing lysosome 
biogenesis in neighbouring cells. To test this hypothesis, 
we collected the medium from HeLa transfected cells 
over-expressing GRN and confirmed GRN protein expres- 
sion in the medium by western blot analysis (data now 
shown). We then used this medium to grow untransfected 
wild-type cells. This resulted in a consistent increase 
(P< 0.003) in LAMP1 and LAMP2 signal compared to 
control as shown in Figure 5c (medium). 



DISCUSSION 

We inferred a network of co-regulated genes from a 
massive gene expression dataset in both human and 
mouse species. We showed that, when properly handled, 
this massive dataset, yields a powerful resource to identify 
both functional and physical interactions among genes, 
and to make hypotheses on transcriptome organization 
and gene function. 

A further advantage of our resource is that it can auto- 
matically integrate gene expression measurements per- 
formed with different technologies, such as microarrays 
and RNA-seq from next-generation sequencing, since 
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Figure 6. Electron microscopy of HeLa cells overexpressing GRN reveals increase in lysosomes size. Electron microscopy of Hela cells after GRN, or 
EGFP over-expression. Lysosomes (indicated in EM images by arrows) appear to be larger in cells that overexpress GRN. Morphometric analysis of 
lysosome diameter (meanSD; n = 0 cells) confirms the increase in the lysosome size in GRN-expressing cells. 



normalization is required only within each experiments 
and not across experiments. 

Genes that are co-regulated in human tend to be 
co-regulated in mouse more than what is expected by 
chance; however, this is not the general trend, since only 
12% of connections are conserved. Our observation adds 
weight to the hypothesis that regulation of gene expression 
may be different between species, even if they share a 
similar proteome. 

The gene network structure is typical of complex 
networks with the presence of hub genes with a large 
number of connections. Our findings suggest that hub 
genes are highly expressed and may have been selected 
to be less dosage sensitive, i.e. not pathological when 
their expression is increased, as confirmed by their 
tendency to have a lower protein disorder. 

We identified biologically relevant functional modules 
within the network, thus providing a modular view of the 
wiring diagram of a cell. Genes connected within function- 
al modules in the network tend to have a 'physical coun- 
terpart' in the 3D conformation of the chromatin. We 
observed a striking similarity between genes that appear 
to be connected, and, therefore, are co-regulated, and their 
physical proximity at the 3D chromatin level. This 
suggests that regulation of coordinated gene expression 
is 'hard-wired' in the physical arrangement of the chroma- 
tin within the nucleus. 

We have shown how the network can be used to identify 
new protein-protein interactions and to investigate the 
function of a disease-gene. 

Upregulation of GRN by known inducers of lysosomal 
biogenesis and function, together with the increase in the 
LAMP1 and LAMP2 signal following GRN over- 
expression, or treatment with medium from GRN 



over-expressing cells, clearly supports a role of GRN in 
lysosome biology. This finding is also supported by 
previous evidence indicating that GRN colocalizes with 
lysosome-associated CD68 antigen in activated macro- 
phages and microglia (45) and is overexpressed in the 
cerebral cortex of MPSIIIB and MPSI mice (46). 
Although GRN may bind the mannose 6-phosphate 
receptor (47,48), more recently it was demonstrated that 
GRN is endocytosed by sortilin and rapidly delivered to 
lysosomes (49). In addition, it has also been shown that 
inhibition of vacuolar ATPase increases intracellular and 
secreted GRN, which again support a lysosome involve- 
ment (50). 

The approach we developed has some intrinsic limita- 
tions, since the MI measure does not allow to recover 
causal relationships between genes, hence we cannot dis- 
tinguish direct from indirect regulation. To overcome this 
limitation, it should be possible to extend the approach to 
infer n-way dependencies, by computing for example con- 
ditional MI, in order to eliminate indirect interations. 
Another limitation is due to the fact that we used a 
variety of tissue and cell types in order to increase the 
statistical power of the inference method; however in so 
doing, we lost the capability of identifying tissue-specific 
(or cell-type specific) co-regulation among genes. Such in- 
formation could be retrieved by first dividing the expres- 
sion dataset according to the sample of origin, and then 
applying a Bayesian framework to determine the probabil- 
ity of each connection of occuring in a specific tissue or 
cell type. 

We have made our resource publicly available as an 
online tool (http://netview.tigem.it). The gene network 
can be easily searched for a gene of interest, queried 
with a Gene Ontology term to detect all the genes with 
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that predicted function, or searched with a list of genes to 
identify one or more common regulators, i.e. genes that 
are significantly co-regulated with most of the genes in the 
query list. 

Additional studies can be easily conducted using our 
online resource. For example, we predicted for each gene 
its function. Specifically, it is possible to search for a 
function of interest, or a cellular compartment and 
identify the TFs predicted to be involved in that 
function, and thus possibly acting as master regulators. 
This analysis could help in hyoptheses generation, which 
should then be followed by ad hoc experimental 
investigation. 

We believe our resource will provide a valuable tool to 
the research community. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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